diabetes_mellitus
hypertension no yes
no 220 31
yes 41 106
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 144.02, df = 1, p-value < 2.2e-16
[1] "Cramér's V = 0.607"
22160 - R for Bio Data Science
The dataset contains 25 features related to chronic kidney disease, collected from 400 individuals in India. In addition to chronic kidney disease (CKD), there is information on co-diagnoses:
Hypertension
Diabetes
Anemia
Dedal edema
Coronary artery disease
Can we identify any physiological markers which are related to a chronic kidney disease diagnosis? If so, which ones?
Data cleaning and augmentation was done using the Tidyverse collection of packages.
Cleaning: Renaming columns and fixing variable types.
Augmenting: Divide into age groups, split and join, estimate globular filtration rate (GFR)
We conducted a correlation analysis and random forest prediction of which biomarkers best predict a CKD diagnosis. For this, we utilized the PerformanceAnalytics and randomForest packages.
Using the equation below, we could estimate GFR and the different stages of CKD people were in. Due to lack of sex data, we estimated an average of male and female GFR values. \[ \text{eGFR}_{\text{cr}} = 142 \times \min\left(\frac{\text{Scr}}{\kappa},\, 1\right)^{\alpha}\times \max\left(\frac{\text{Scr}}{\kappa},\, 1\right)^{-1.200}\times 0.9938^{\text{Age}}\times 1.012 \;\; \text{[if female]} \]
Hypertension and diabetes was only present in those with a CKD diagnosis.
diabetes_mellitus
hypertension no yes
no 220 31
yes 41 106
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 144.02, df = 1, p-value < 2.2e-16
[1] "Cramér's V = 0.607"
Findings:
CKD is best predicted by albumin in urine, hemoglobin concentration, PCV, RBC count, and creatinine levels.
GFR estimate aligns with CKD diagnosis.
Hypertension and diabetes is more common in those with CKD
In this data, many patients had severe CKD
Caveats and possible improvements:
Data did not need extensive cleaning.
GFR estimate done without information on sex, meaning decreased accuracy.
More information on the data source needed for more accurate conclusions.
Data:
Chronic KIdney Disease dataset. Kaggel.com. Available: https://www.kaggle.com/datasets/mansoordaku/ckdisease/data
Packages:
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., et al. (2019). Welcome to the Tidyverse. Journal of Open Source Software 4, 1686. https://doi.org/10.21105/joss.01686.
Peterson, B.G., Carl, P., Boudt, K., Bennett, R., Ulrich, J., Zivot, E., Cornilly, D., Hung, E., Lestel, M., Balkissoon, K., et al. (2024). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis.
A.C. (Fortran, port), A.L. (R, and port), M.W. (R (2024). randomForest: Breiman and Cutlers Random Forests for Classification and Regression.
Miscellaneous:
CKD-EPI Creatinine Equation (2021) | National Kidney Foundation.
Kaufman, D.P., Basit, H., and Knohl, S.J. (2025). Physiology, Glomerular Filtration Rate. In StatPearls, (Treasure Island (FL): StatPearls Publishing), p.